Steps on exploration of a data frame

  1. Understanding the structure of the data

  2. Take a Look at of the data

  3. Visulation of the data

Bring rectangular data in

In this hw02, we are going to work with gapminder and dplyr data(Probably via the tidyverse meta-package). Install them if you have not done so already. I already intalled the packages, so I just comment out the commands.

#install.packages("gapminder")
#install.packages("tidyverse")

Load them.

library(gapminder)
library(tidyverse)
## ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ─ Conflicts ───────────────────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Smell test data

The purpose of this part is to explore gapminder object.

1, Is it a data.frame, a matrix, a vector, a list

mode(gapminder)
## [1] "list"
typeof(gapminder)
## [1] "list"

After solved my confustion about the difference between mode and typeof, I knew that they all show the type or storage mode of any object but the set of names might be different.

Modes have the same set of names as types (see typeof) except that

  • types “integer” and “double” are returned as “numeric”.

  • types “special” and “builtin” are returned as “function”.

  • type “symbol” is called mode “name”.

  • type “language” is returned as “(” or “call”.

From R Documentation

According to words mentioned above, usemode and typeof will generate the same output list in gapminder.

2, What is its class?

class(gapminder)
## [1] "tbl_df"     "tbl"        "data.frame"

3, How many variables/columns?

ncol(gapminder)
## [1] 6

4, How many rows/observations?

nrow(gapminder)
## [1] 1704

5, Can you get these facts about “extent” or “size” in more than one way? Can you imagine different functions being useful in different contexts?

From Q3 and Q4, dimension of gapminder can get repectively. 1st method works when you only need to know the dimension, while 2nd method works when you also care about the data type and want to preview the data it contained

dim(gapminder)
## [1] 1704    6
  • Another method
# Tells the dimension of the data frame,shows the name of each variable followed by its data type and the preview of data contained in it.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

6, What data type is each variable?

head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
  • Another method
#returns a list of the same length as 'gapminder', each element of which is the result of applying CLASS to the corresponding element of 'gapminder'.
lapply(gapminder,class)
## $country
## [1] "factor"
## 
## $continent
## [1] "factor"
## 
## $year
## [1] "integer"
## 
## $lifeExp
## [1] "numeric"
## 
## $pop
## [1] "integer"
## 
## $gdpPercap
## [1] "numeric"

Explore individual variables

Pick at least one categorical variable and at least one quantitative variable to explore.

Explore categorical variable continent
  • What are possible values (or range, whichever is appropriate) of each variable?

  • Feel free to use summary stats, tables, figures. We’re NOT expecting high production value (yet).

  • What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand.

After knew the data type of each variable, I picked continent as categorical variable and pop as quantitative variable. For continent

Firstly, to get access to the levels attribute of a variable, I used levels, it returns the value of the levels of its argument.

levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Also, to get distinct arguments of variable continent, I used unique

unique(gapminder$continent)
## [1] Asia     Europe   Africa   Americas Oceania 
## Levels: Africa Americas Asia Europe Oceania

After that, summary is chosen to describe the result summaries.

summary(gapminder$continent) %>% 
    knitr::kable()
x
Africa 624
Americas 300
Asia 396
Europe 360
Oceania 24
continent.counts <- table(gapminder$continent)
continent.counts
## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24
continent.prop <- continent.counts / sum(continent.counts)
continent.prop
## 
##     Africa   Americas       Asia     Europe    Oceania 
## 0.36619718 0.17605634 0.23239437 0.21126761 0.01408451

After the exploration on the number of countries in each continent. Barplot is applied to display the counts of categorial variable

barplot(continent.counts, col = cm.colors(length(continent.counts)), xlab = "continents", ylab = "count",xlim = NULL, ylim =c(0,800), main = "number of countries in each continent")

To get a more directly overview of each continent, I converted counts of each continent to proportions and visualized the proportions in a pie chart.

lab <- levels(gapminder$continent)
piepercent <- round(100*continent.prop,1)
pie(continent.counts,labels = piepercent, 
    main="Pie Chart of the  Proportions of each contient",
    col = terrain.colors(length(continent.counts)))
legend("topright",lab,cex=0.7,
       fill = terrain.colors(length(continent.counts)))

Explore quantitative variable pop
  • What are possible values (or range, whichever is appropriate) of each variable?

  • What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand.

  • Feel free to use summary stats, tables, figures. We’re NOT expecting high production value (yet).

For quantitative variable pop, it’s good to obtain the range of it by range and minimum, 1st quartiles, median, mean, 3rd quartiles and maximum values by summary at first.

  range(gapminder$pop)
## [1]      60011 1318683096
  summary(gapminder$pop)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 6.001e+04 2.794e+06 7.024e+06 2.960e+07 1.959e+07 1.319e+09

To preview the first and last 5th line of pop variable

head(gapminder$pop,n=5)
## [1]  8425333  9240934 10267083 11537966 13079460
tail(gapminder$pop,n=5)
## [1]  9216418 10704340 11404948 11926563 12311143

Check the distribution of the pop variable

gapminder %>%
  ggplot(aes(x=pop)) +
  geom_histogram(bins=30) +
  scale_x_log10()

Combination of histogram and density plot

gapminder %>%
  ggplot(aes(pop)) +
  geom_histogram(aes(y=..density..),bins=30) +
  geom_density(alpha=0.2,fill='blue') +
  scale_x_log10()

Explore various plot types

Make a few plots, probably of the same variable you chose to characterize numerically. You can use the plot types we went over in class (cm006) to get an idea of what you’d like to make. Try to explore more than one plot type. Just as an example of what I mean:

A scatterplot of two quantitative variables.
A plot of one quantitative variable. Maybe a histogram or densityplot or frequency polygon.
A plot of one quantitative variable and one categorical. Maybe boxplots for several continents or countries.

You don’t have to use all the data in every plot! It’s fine to filter down to one country or small handful of countries.

We can explore the relationship between population and year in each continent

ggplot(gapminder,aes(x = continent,y = pop , color = year)) + 
  scale_y_log10() +
  geom_jitter(alpha = 0.5) +
  geom_violin(alpha = 0.1) +
  labs(title = "Jitterplot Combined with violinplot of population in each continent by year")

From this plot, it can be noticed that the range of population in Asia is higher than other continents, and the population density in Oceania is lower in almost any time.

I’m going to use scatterplot to display the relationship between lifeExp,gdpPercap in each continent in different years

ggplot( gapminder, aes(x=gdpPercap , y=lifeExp, color=pop)) + 
  geom_point(size=1,alpha=0.3) + 
  scale_color_distiller(palette = "RdPu")

From the output plot,I think gdpPercap is increase with the increase of lifeExp,the same trend of population might also works.

I will now make a conparision between the population of each continent in the year of 1977.

d <- gapminder %>%
  filter(year==1977)

ggplot(d, aes(x=continent, y=pop, fill=continent)) + 
    geom_boxplot(alpha=0.3) +
    scale_y_log10() 

After observation, the population of Asia and the range of it is higher than other areas.

Use filter(), select() and %>%

Use filter() to create data subsets that you want to plot.

Practice piping together filter() and select(). Possibly even piping into ggplot().

In this part,I will install plotly library to get an interactive version

#install.packages("plotly")
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
a <- gapminder %>%
  select(-country) %>%
  filter(year==1967) %>%
  ggplot( aes(lifeExp,gdpPercap,size = pop, color=continent)) +
  geom_point() +
  scale_y_log10() +
  theme_bw()
 
ggplotly(a)

But I want to do more!

Evaluate this code and describe the result. Presumably the analyst’s intent was to get the data for Rwanda and Afghanistan. Did they succeed? Why or why not? If not, what is the correct way to do this? filter(gapminder, country == c("Rwanda", "Afghanistan"))

Read [What I do when I get a new data set as told through     tweets](https://simplystatistics.org/2014/06/13/what-i-do-when-i-get-a-new-data-set-as-told-through-tweets/) from [SimplyStatistics](https://simplystatistics.org/) to get some ideas!

Present numerical tables in a more attractive form, such as using `knitr::kable()`.

Use more of the dplyr functions for operating on a single table.

Adapt exercises from the chapters in the “Explore” section of [R for Data Science](http://r4ds.had.co.nz/) to the Gapminder dataset.

To vertify whether it’s correct or not, just run it

filter(gapminder, country == c("Rwanda", "Afghanistan"))
## # A tibble: 12 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1957    30.3  9240934      821.
##  2 Afghanistan Asia       1967    34.0 11537966      836.
##  3 Afghanistan Asia       1977    38.4 14880372      786.
##  4 Afghanistan Asia       1987    40.8 13867957      852.
##  5 Afghanistan Asia       1997    41.8 22227415      635.
##  6 Afghanistan Asia       2007    43.8 31889923      975.
##  7 Rwanda      Africa     1952    40    2534927      493.
##  8 Rwanda      Africa     1962    43    3051242      597.
##  9 Rwanda      Africa     1972    44.6  3992121      591.
## 10 Rwanda      Africa     1982    46.2  5507565      882.
## 11 Rwanda      Africa     1992    23.6  7290203      737.
## 12 Rwanda      Africa     2002    43.4  7852401      786.

Still not sure, let's run it seperately

filter(gapminder, country == c("Rwanda"))
## # A tibble: 12 x 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda  Africa     1952    40   2534927      493.
##  2 Rwanda  Africa     1957    41.5 2822082      540.
##  3 Rwanda  Africa     1962    43   3051242      597.
##  4 Rwanda  Africa     1967    44.1 3451079      511.
##  5 Rwanda  Africa     1972    44.6 3992121      591.
##  6 Rwanda  Africa     1977    45   4657072      670.
##  7 Rwanda  Africa     1982    46.2 5507565      882.
##  8 Rwanda  Africa     1987    44.0 6349365      848.
##  9 Rwanda  Africa     1992    23.6 7290203      737.
## 10 Rwanda  Africa     1997    36.1 7212583      590.
## 11 Rwanda  Africa     2002    43.4 7852401      786.
## 12 Rwanda  Africa     2007    46.2 8860588      863.
filter(gapminder, country == c("Afghanistan"))
## # A tibble: 12 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.

By observation, it seems like the 1st method overlapped some data since the 2 countries appeared in the same year. To solve the problem, by introducing %in%, it is value matching and “returns a vector of the positions of (first) matches of its first argument in its second”, while == is logical operator, in this case, which means some variables are overlapped since one of its attribute(eg. year) happened to be the same. Fixed version:

filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
## # A tibble: 24 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 14 more rows

According to our analyzation above, this output is correct!